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Abstract:  As  per  the  hurriedly  ever-increasing 

attractiveness  of  Data  mining  in  different  firms  and 
association  for  example  banking,  medicine,  scientific 
research  and  among  government  agencies  there  are 
numerous  possible  methods  are  obtainable  for  data 
examination.  It  allows  users  to  examine  data  from  different 
angles  and  from  different  dimensions,  combining  it,  and 
summarize  the  relationships  recognized.  We  have  to  store  a 
datasets  in  particular  format  of  one  of  the  particular  store 
and  data  stored  in  text  database  it  should  be  in  json,  csv, 
XML  format.  We  all  know  this  store  datasets  are  in  the 
form  of  text  datasets  which  is  neither  be  a unstructured  nor 
completely  structured  means  it  is  a semi -structured 
datasets.  As  per  our  study  in  previous  review  paper  we  have 
implemented  this  semi -structured  datasets  in  this  proposed 
implementation  paper.  Data  classification  or  extraction 
process  is  done  to  handle  the  un-structured  data  such  as 
abstract  and  contents  in  the  book.  So  in  our  proposed 
implementation  paper  we  are  using  Apriori -Hybrid 
algorithm  which  is  combination  of  weighed  Apriori  and 
hash  tree  algorithm  for  preceding  our  datasets  for  search 
result.  Moreover  this  obtained  result  is  further  proceed 
toward  FDM  (Fast  distributed  algorithm)  for  comparison 
purpose. 

Keyword:  Data  mining,  frequent  item  sets,  Apriori 
hybrid  algorithm,  FDM  (fast  distributed  mining 
algorithm),  java  parser. 


1.  INTRODUCTION 

As  per  our  research  and  our  gathered  knowledge 
from  various  research  work  we  recognized  that  the  data 
mining  is  most  important  task  which  needed  in  different 
firms  and  organization,  in  business  purpose,  in  government 
agencies  etc.  This  means  that  data  mining  is  most  research 
area  where  we  can  do  the  more  invention  for  mining  data 


from  the  datasets.  This  is  extracting  or  classifying  the 
needed  data  from  the  huge  datasets.  We  can  assume  a data 
mining  process  as,  there  is  a huge  container  like  structure  in 
which  there  will  be  a large  amount  of  data  or  usually 
datasets  are  maintained  or  stored.  Data  extraction  or 
knowledge  acquaintance  is  a process  of  extracting  or 
invoking  the  data  from  that  huge  container.  Pursuing 
knowledge  from  huge  quantity  of  data  is  one  of  the  most 
favored  features  of  Data  Mining. 

This  adaptation  won't  occur  instinctively,  that's 
where  Data  Mining  comes  into  representation.  In 
Experimental  Data  Analysis,  some  preliminary 
acquaintance  is  known  about  the  data  means  we  have 
fractional  information  about  particular  data,  but  Data 
Mining  could  help  in  a more  intensity  acquaintance  about 
the  data  by  exploring  the  data  in  detail  and  its  similar 
information  is  also  get  by  data  mining. 

A number  of  Data  Mining  techniques  such  as 
association,  clustering,  classification  are  come  in  account  of 
us  to  mine  this  vast,  huge  amount  of  data.  These  techniques 
have  their  own  advantages  and  features.  As  we  can  describe 
one  by  one  distinctiveness  of  this  techniques, 

1.  Association  rule  mining: 

The  aim  of  association  rule  discovery  is  to  find  associations 
among  items  from  a set  of  relations,  each  of  which  contains 
a set  of  items  as  per  called  it  as  dataset.  Not  all  of  the 
association  rules  discovered  within  an  operation  set  are 
exciting  or  beneficial.  Usually  the  algorithm  finds  a subset 
of  association  rules  that  fulfill  certain  restriction. 

2.  Clustering: 

In  this  method  of  data  mining  all  the  correlated  substance  in 
a datasets  is  cluster  together  that  means  this  method 
congregate  or  group  many  items  which  are  close  to  each 
other.  This  form  of  mining  method  is  known  as  clustering. 

3.  Classification: 

Classification  is  a procedure  in  which  the  data  is  extracted 
or  selected  by  passing  query  and  as  per  necessities  of  user 
query  the  data  is  extracted  or  mine. 
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Know  as  we  are  talking  about  the  data  mining  task 
on  datasets  there  will  be  a different  type  datasets  are 
present.  When  we  are  talking  about  text  type  datasets  the 
data  present  in  the  data  container  is  neither  in  completely 
structured  nor  un-structured  which  is  known  as  Semi- 
structured  datasets.  For  example,  a document  may  contain  a 
few  structured  fields,  such  as  title,  authors,  publication 
date,  length,  and  category,  and  so  on,  but  also  contain  some 
largely  unstructured  text  components,  such  as  abstract  and 
contents. 

But,  old-fashioned  Information  Retrieval 
procedure  becomes  inadequate  for  the  rapid  progressively 
vaster  amount  of  text  data.  Characteristically,  only  a small 
sector  of  the  many  available  documents  will  be  suitable  to  a 
given  character  or  user.  Without  significant  what  could  be 
in  the  documents,  it  is  complicated  to  express  effective 
queries  for  analyzing  and  extracting  useful  information 
from  the  data.  Users  need  tools  to  associate  different 
documents,  rank  the  importance  and  consequence  of  the 
documents,  or  find  patterns  and  trends  across  multiple 
documents.  Thus,  Text  Mining  has  become  a more  and 
more  popular  and  indispensable  theme  in  Data  Mining. 
This  text  mining  saves  a lot  of  effort  of  persons.  In  this 
paper  we  were  proposing  java  parser  to  perform  the 
withdrawal  on  the  large  data  sets  from  that  extraction  we 
got  the  predictable  result  of  datasets  that  we  want  a 
compulsory  and  then  further  advance  toward  APRIORI  - 
Hybrid  algorithm  which  is  formed  by  merging  weighted 
apriori  and  T Hash  apriori.  After  this  result  get  by 
algorithm  is  transferred  to  FDM  association  rule  mining 
algorithm  for  judgment  of  efficiency,  frequency,  memory 
consumption  and  other  parameters. 


2.  LITERATURE  WORK 

After  reviewing  work  of  other  professional  developers  we 
recognized  that  data  mining  work  has  endless  future  for  its 
research  in  this  mining  world.  So  a procedure  for  secure 
mining  of  association  rules  in  horizontally  distributed 
databases  that  improve  expressively  upon  the  current 
leading  protocol  in  terms  of  confidentiality  and 
effectiveness  are  described  by  TamirTassa  [1].  Merry  KP, 
Rabindra  Kumar  Singh,  Swaroop.S. Kumar  [2]  has  aim  is 
made  to  classify  standard  colon  cancer  microarray  dataset 
using  Association  rule  mining  algorithm,  namely  Apriori - 
Hybrid.  D.W.L.  Cheung,  J.  Han,  V.T.Y.  Ng,  A.W.C.  Fu, 
and  Y.  Fu,  [3]  has  there  an  efficient  arrangement  of  the 
Apriori-Hybrid  algorithm  over  Apriori.  R.  Agrawal  and  R. 
Srikant  [4]  has  represents  large  database  of  customer 
transactions.  Each  transaction  maintains  of  items  acquired 
by  a customer  in  a stay.  Rakesh  Agrawal  Tomasz 
Imielinski  Arun  Swami  [5]  has  describes  the  complexity  of 


mining  generalized  association  rules.  ZijianZheng,  Ron 
Kohavi,  and  Llew  Mason  [6]  has  exploit  an  Association 
Rule  mining  and  suggests  an  algorithm  that  associated  the 
simple  association  rules  resulting  from  basic  Apriori 
Algorithm  with  the  numerous  minimum  maintenance  using 
maximum  constraints.  Jiawei  Han,  Jian  Pei,  Yiwen  Yin  and 
Runying  Mao  [7]  has  recommend  new  Frequent-Pattern 
tree  ( FP  tree)  structure.  FionnMurtagh  Mohsen  Farid  [8] 
has  explain  Rare  association  rules  are  those  that  only  seem 
hardly  ever  even  though  they  are  highly  associated  with 
very  accurate  data. 


3.  PROPOSED  METHOD 

Now  by  bearing  in  mind  their  work  and  by 
keeping  in  mind  all  the  ideas  they  has  developed  regarding 
association  rule  we  are  going  to  recommend  association 
rule  mining  using  APRIORI-hybrid  algorithm.  After 
applying  Apriori-Hybrid  algorithm  we  precede  the 
outputted  result  for  comparison  purpose  to  the  next 
algorithm  of  association  rule  that  is  FDM  (Fast  distributed 
mining)  to  compare  efficiency,  frequency,  memory 
consumption  and  other  parameters.  As  we  know  the  data 
stored  in  a text  database  is  semi-structured  that  means  it  is 
not  completely  structured  as  there  in  a normal  database 
data.  And  not  completely  unstructured  means  a complete 
huge  continues  related  data. 

If  we  consider  example  such  as  book  related  text 
dataset,  in  that  title,  author  name,  etc  data  are  in  structured 
form  and  abstract  and  contents  of  book  are  in  unstructured 
form.  So  mining  such  semi -structured  data  for  pattern 
analysing  for  intensity  knowledge  of  the  pattern  item  sets  is 
nearly  difficult  to  search  in  today’s  required  huge  datasets. 
So  we  are  projecting  in  this  paper  Apriori-Hybrid 
algorithm.  First  we  parse  a huge  datasets  by  our  java  parser 
to  extract  required  item-sets  from  a datasets.  Then  apply 
Apriori-hybrid  algorithm  for  association  mining.  And  then 
we  compare  both  acquired  results  for  accuracy,  efficiency 
and  other  parameters  by  hybrid  algorithm  and  FDM  (Fast 
distribution  mining)  algorithm  and  represent  it  graphically 
to  show  the  result  of  our  output  on  comparison  with  other 
association  mining  methods. 

APRIORI_HYBRID  Algorithm: 

Apriori-Hybrid  algorithm  is  combination  of 
Weighted  Apriori  Algorithm  and  Apriori  Hash  T algorithm. 
By  considering  both  the  algorithms  beneficial  points  we 
have  decided  to  use  this  algorithm  to  mine  such  difficult 
text  type  semi-structured  dataset.  Based  on  the  previous 
work,  it  has  been  proposed  to  overcome  the  drawbacks  of 
existing  algorithms. 
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Benefits  of  the  hybrid  approach 

1.  It  will  decrease  the  calculation  time  as  well  as  the 
accurateness  of  the  numerous  item  predictions. 

2.  It  will  avoid  the  candidate  set  production. 

3.  It  will  produce  the  tree  configuration  and  examine  their 
height,  weight  and  reach  ability  for  each  knot. 

4.  The  work  absolutely  gives  best  conclusion  for  the  mining 
of  recurrent  item  sets  with  the  state-of-art  technologies 
like  hadoop  hdfs,  vmware  and  eucalyptus  platforms. 

5.  Such  an  arrangement  of  modified  algorithm  in  hadoop 
VMWare  environment  doesn’t  exist  earlier  with  reverence 
to  the  literature  survey. 

Weighted  Apriori: 

While  using  the  Apriori  algorithm  each  operation 
in  the  datasets  is  measured  as  a record  and  each  dataset 
record  is  the  item.  The  Apriori  is  the  classic  algorithm  that 
is  most  extensively  used  for  the  invention  of  association 
rules.  There  are  mainly  two  steps  in  association  rule 
mining,  first  to  find  all  the  recurrently  occurred  sets  in  the 
datasets,  by  using  the  recurrent  item- set  produce  strong 
association  rules  that  assure  the  minimum  support  and  self- 
assurance  value. 

The  Apriori  is  based  on  the  record  sets  that  are 
regularly  occurred  in  the  dataset.  Apriori  algorithm  is 
essentially  a layer-by-layer  iterative  penetrating  algorithm, 
where  k-itemset  is  used  to  discover  the  (k+  l)-itemset.  It 
first  scans  the  datasets  to  find  number  of  occurrences  of 
each  item  on  the  given  dataset.  The  Itemsets  which  assure 
the  minimum  support  and  the  confidence  value  is  set  as  the 
recurrent  1 - itemset. 

Now,  the  frequent  1 -Itemsets  were  coupled 
together  to  form  the  k - itemset.  The  permutation  of  the 
generated  candidate  set,  the  pruning  step  is  taken  place, 
which  will  remove  the  inappropriate  items,  that  does  not 
assure  the  minimum  support  and  the  confidence  value.  The 
weighted  Apriori  as  such  follows  the  classic  rules;  apart 
from  provide  the  weights  to  each  data  item. 

The  weights  were  allocated  in  such  a way  that, 
after  the  candidate  item  sets  production,  the  disjointing  of 
the  attribute  taken  place.  Each  itemset  is  provided  with 
weights  that  are  minimum,  average  and  maximum  value  for 
the  itemset  for  easier  splitting  up  of  datasets.  Based  on  the 
weights  threshold  the  dataset  item  is  pruned,  and  their 
association  rules  were  produced.  Based  on  the  generated 
association  rules,  regularly  occurred  Itemsets  were  easily 
mined. 

Disadvantages: 

1.  Itemset  arrangement  will  be  generated  recurrently.  It  will 
amplify  the  candidate  itemsets. 

2.  The  weight  calculation  for  the  each  transaction  will  take 
more  time  to  accomplish. 

3.  Less  accurateness. 

4.  Not  depend  on  data  divergence. 


Hash  Tree  Apriori 

Our  hash  based  Apriori  achievement,  uses  a data 
structure  that  directly  stand  for  a hash  table.  This  algorithm 
proposes  overcoming  some  of  the  weak  points  of  the 
Apriori  algorithm  by  dropping  the  number  of  candidate  k- 
itemsets.  In  particular  the  2-itemsets,  since  that  is  the  key  to 
improving  presentation.  This  algorithm  uses  a hash  based 
technique  to  condense  the  number  of  candidate  itemsets 
produced  in  the  first  pass.  It  is  asserted  that  the  number  of 
itemsets  in  C2  produced  using  hashing  can  be  minor,  so 
that  the  scan  essential  to  establish  L2  is  more  efficient. 

Algorithm: 

1.  Scan  all  the  transaction.  Create  possible  2-itemsets. 

2.  Let  the  Hash  table  of  size  8. 

3.  Lor  each  bucket  assign  a candidate  pairs  using  the  ASCII 
values  of  the  itemsets. 

4.  Each  bucket  in  the  hash  table  has  a count,  which  is 
improved  by  1 each  item  an  item  set  is  hashed  to  that 
bucket. 

5.  If  the  bucket  count  is  equal  or  above  the  minimum 
support  count,  the  bit  vector  is  set  to  1 . Otherwise  it  is  set  to 
0. 

6.  The  candidate  pairs  that  hash  to  locations  where  the  bit 
vector  bit  is  not  set  are  detached. 

7.  Adjust  the  transaction  datasets  to  include  only  these 
candidate  pairs. 

Disadvantages: 

1 . High  calculations  requirement. 

2.  High  memory  consumption. 

3.  Knots  dispensation  requires  high  time  to  compute. 

4.  IMPLEMENTATION  FLOW: 


Result 
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Modules: 

1 . Login  mechanism. 

2.  Upload  the  dataset  file. 

3.  Parse  the  dataset  file  using  Java  Parsers. 

4.  Process  the  dataset  file  after  parsing  using  Apriori 
Hybrid  Algorithm  (Algorithm  Implementation). 

5.  Search. 

6.  Analytics  (Graphs). 

7 Comparison  with  any  other  Algorithm  to  identify  its 
efficiency 
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This  will  show  the  result  of  both  algorithms. 
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After  importing,  datasets  will  display. 


Finally  the  system  shows  the  graph  based  result  of  both  the 
algorithm  with  respect  to  their  memory,  time  computation, 
and  CPU  utilization. 
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6.  CONCLUSIONS 

In  the  implemented  system  the  goal  of  extracting  the 
interesting  patterns  from  the  largely  content  of  datasets 
for  the  purpose  of  discovering  the  knowledge  is  get 
achieved.  So  the  advantages  of  Apriori  hybrid 
algorithm  considering  or  merging  both  weighted 
apriori  and  hash  based  apriori  algorithm  are  worth  full 
for  our  system  configuration.  This  will  result  in  less 
memory  utilization,  less  time  computation  and  less 
CPU  utilization  while  mining  the  datasets  for 
knowledge  discovery  purpose.  Moreover  the  system 
shows  it  graphically. 
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